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IMPROVED VARIABLE SELECTION WITH FORWARD-LASSO 
ADAPTIVE SHRINKAGE 1 

By Peter Radchenko and Gareth M. James 
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Recently, considerable interest has focused on variable selection 
methods in regression situations where the number of predictors, p, 
is large relative to the number of observations, n. Two commonly 
applied variable selection approaches are the Lasso, which computes 
highly shrunk regression coefficients, and Forward Selection, which 
uses no shrinkage. We propose a new approach, "Forward-Lasso Adap- 
tive SHrinkage" (FLASH), which includes the Lasso and Forward Se- 
lection as special cases, and can be used in both the linear regression 
and the Generalized Linear Model domains. As with the Lasso and 
Forward Selection, FLASH iteratively adds one variable to the model 
in a hierarchical fashion but, unlike these methods, at each step ad- 
justs the level of shrinkage so as to optimize the selection of the next 
variable. We first present FLASH in the linear regression setting and 
show that it can be fitted using a variant of the computationally 
efficient LARS algorithm. Then, we extend FLASH to the GLM do- 
main and demonstrate, through numerous simulations and real world 
data sets, as well as some theoretical analysis, that FLASH generally 
outperforms many competing approaches. 

1. Introduction. Consider the traditional linear regression model 

p 

(1) Y i = pQ + y ^X ij Pj+s i , i = l,...,n, 

i=i 

with p predictors and n observations. Recently attention has focused on 
the scenario where p is large relative to n. In this situation there are many 
methods that outperform ordinary least squares (OLS) [Frank and Fried- 
man (1993)]. One common approach is to assume that the true number of 
regression coefficients, that is, the number of nonzero /3j's, is small, in which 
case estimation results can be improved by performing variable selection. 
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Many classical variable selection methods, such as Forward Selection, have 
been proposed. More recently, interest has focused on an alternative class of 
penalization methods, the most well known of which is the Lasso [Tibshi- 
rani (1996)]. In addition to minimizing the usual sum of squares, the Lasso 
imposes an L\ penalty on the coefficients, which has the effect of automati- 
cally performing variable selection by setting certain coefficients to zero and 
shrinking the remainder. While the shrinkage approach can work well, it 
has been shown that in sparse settings the Lasso often over-shrinks the co- 
efficients. Numerous alternatives and extensions have been suggested. A few 
examples include SCAD [Fan and Li (2001)], the Elastic Net [Zou and Hastie 
(2005)], the Adaptive Lasso [Zou (2006)], the Dantzig selector [Candes and 
Tao (2007)], the Relaxed Lasso [Meinshausen (2007)], VISA [Radchenko and 
James (2008)] and the Double Dantzig [James and Radchenko (2009)]. 

The Lasso has been made particularly appealing by the advent of the 
LARS algorithm [Efron et al. (2004)] which provides a highly efficient means 
to simultaneously produce the set of Lasso fits for all values of the tuning 
parameter. The LARS algorithm starts with an empty set of variables and 
then adds the predictor, say, X,-, most highly correlated with the response. 
Next, the corresponding estimated coefficient, /5L-, is adjusted in the direc- 
tion of the least squares solution. The algorithm "breaks" when the absolute 
correlation between Xj and the residual vector, Y — A/3, is reached by the 
corresponding correlation for another predictor. The new predictor, say, X&, 
is then added to the model, and the coefficients f3j and are increased to- 
ward their joint least squares solution until some other variable's correlation 
matches those of Xj and X*., at which point the new variable is also added 
to the model. This process continues until all the correlations have reached 
zero, which corresponds to the ordinary least squares solution. 

By comparison, a common version of Forward Selection also starts with 
an empty model and then iteratively adds to the model the variable most 
highly correlated with the current residual vector. Next, the residuals are 
recomputed using the ordinary least squares solution, based on the cur- 
rently selected variables. This algorithm repeats until all the variables have 
been added to the model. In comparing Forward Selection with LARS, one 
observes that the main difference is that the former method drives the re- 
gression estimates for the currently selected variables all the way to the least 
squares solution, while LARS only moves them part way in this direction. 
Hence, the Lasso estimates the residual vector using shrunk regression coef- 
ficients, while Forward Selection uses unshrunk estimates. Which approach 
is superior? In Section 2 we show that, even for toy examples with no noise 
in the response, neither universally dominates the other. In some situations 
the Lasso's high level of shrinkage produces the best results, while in other 
cases unshrunk estimates work better. 
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In this paper we suggest viewing the Lasso and Forward Selection as two 
extremes on a continuum of possible model selection rules. Instead of select- 
ing candidate models using either highly shrunk or else completely unshrunk 
coefficients, we propose a methodology that can adaptively adjust the level 
of shrinkage at each step in the algorithm. We call our approach "Forward- 
Lasso Adaptive SHrinkage" (FLASH). As with LARS, our algorithm selects 
the variable most highly correlated with the residuals and drives the selected 
coefficients toward the least squares solution. However, instead of stopping 
at the highly shrunk Lasso point or the zero shrinkage Forward Selection 
point, FLASH uses the data to adaptively choose, at each step, the opti- 
mal level of shrinkage before selecting the next variable. FLASH includes 
Forward Selection and the Lasso as special cases, yet has the same order of 
computational cost as the Lasso. After introducing FLASH in the linear re- 
gression setting, we then extend it to the Generalized Linear Models (GLM) 
domain. Thus, FLASH can also be used to perform variable selection in high 
dimensional classification problems using, for example, a logistic regression 
framework. This significantly expands the range of problems that FLASH 
can be applied to. We show through extensive simulation studies, as well as 
theoretical arguments, that FLASH significantly outperforms Forward Se- 
lection, the Lasso and many alternative methods, in both the regression and 
the GLM domains. 

Our paper is structured as follows. In Section 2 we demonstrate that nei- 
ther Forward Selection nor the Lasso universally dominate each other. We 
present the FLASH methodology in the linear regression setting and outline 
an algorithm for efficiently constructing its path. Some theoretical properties 
of FLASH are also discussed. Then in Section 3 we present a detailed simu- 
lation study to examine the practical performance of FLASH in comparison 
to Forward Selection, the Lasso and other competing methods. FLASH is 
extended to the GLM setting in Section 4 and further simulation results are 
provided. In Section 5 FLASH is demonstrated on several real world data 
sets, predicting baseball salaries, real estate prices and whether an internet 
image is an advertisement. These data sets all have many predictors, up to 
p = 1430, and involve both linear regression and GLM scenarios. We end 
with a discussion in Section 6. 

2. Methodology. Using suitable location and scale transformations, we 
can standardize the data so that the response, Y, and each predictor, Xj, 
are mean zero with ||Xj|| = 1. Throughout the paper we assume that this 
standardization holds. However, all numerical results are presented on the 
original scale of the data. 

2.1. Lasso versus Forward Selection. As discussed in the introduction, 
both the LARS implementation of the Lasso and the Forward Selection 
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Fig. 1. Plots showing regions where the Lasso and Forward Selection will identify the 
correct model for different correlation structures. Points above the dashed lines correspond 
to the Lasso regions. Points between the dash dot lines correspond to Forward Selection. 
The solid lines provide the regions of feasible correlation combinations. 

algorithm choose the variable with the highest absolute correlation and then 
drive the selected regression coefficients toward the least squares solution. 
The key difference is that Forward Selection produces unshrunk estimates 
by utilizing the least squares solution while the Lasso uses shrunk estimates 
by only driving the coefficients part way. Which approach works better? It 
is not hard to show that even in simple settings neither approach dominates 
the other. 

Consider, for example, a scenario involving a linear model with two signal 
predictors, one noise variable and no error term. Denote by psi,S2 the cor- 
relation between the signal predictors and let ps it N denote the correlation 
between the ith signal and jth noise variable. Provided the coefficient for 
the first signal variable is large enough, this variable is the one most highly 
correlated with the response, thus it is the first selected by both the Lasso 
and Forward Selection. In this setting one can directly calculate the values 
of ps 1 ,s 2 ) PSi,Ni an d Ps 2 ,N 1 where the Lasso or Forward Selection selects the 
"correct" set of variables. Figure 1 provides an illustration for three differ- 
ent values of pSi,Ni- The regions between the dash dot curves correspond 
to the values of ps!,s 2 an d PSi,Ni where Forward Selection will identify the 
correct model. Alternatively, the regions above the dashed curve represent 
the same situations for the Lasso. The solid lines encompass the regions of 
feasible correlation combinations. Even in this simplified example it is clear 
that there are many cases where Forward Selection succeeds and the Lasso 
fails, and vice versa. 
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Fig. 2. Absolute correlations of the two signal variables (black and gray solid) and two 
noise variables (black and gray dashed) for different values of ps 1 N 1 and psxN 2 in the 
example considered in Section 2.1. The first dotted vertical indicates the Lasso break point, 
and the second dotted vertical corresponds to Forward Selection. The line (other than 
black solid) with the highest value at the break point indicates the variable selected by the 
corresponding method. The Lasso succeeds only in (a) and (c), and Forward only in (a) 
and (b). 



Figure 2 graphically illustrates how the Lasso, Forward Selection or both 
methods could fail, using the same simple setup with one additional noise 
variable. For each plot the four lines represent the absolute correlation be- 
tween the corresponding variable and the residual vector; solid lines for signal 
variables and dashed lines for noise variables. The left-hand side of the plot 
corresponds to the null model with all coefficients set to zero, and the lines 
show how the correlations change as coefficients are adjusted toward the 
least squares solution. Each plot represents different values of ps lt Ni and 
PSi,n 2 - The values for the other relevant parameters are fixed for all four 
plots at fix = 2,/3 2 = l,P5i,5 2 =0.5,ps 2) jvi = Ps 2 ,n 2 = 0.8. 

In all four plots the black solid line, representing the first signal variable, 
has the maximal correlation for the null model, so both the Lasso and For- 
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ward Selection choose this variable first and drive its coefficient toward the 
least squares solution. However, the Lasso stops when the black line inter- 
sects with one of the other variables and adds that variable next, the first 
vertical dotted line in each plot, while Forward Selection drives the black line 
to zero, that is, the least squares solution, and then selects the variable with 
the maximal correlation, the second dotted line. For a method to choose the 
correct model it must select the second signal variable, represented by the 
gray solid line. In Figure 2(a) the gray solid line is the highest at both the 
Lasso and Forward Selection stopping points, so both methods choose the 
correct model. However, in Figure 2(b) the Lasso selects the black dashed 
noise variable, while Forward Selection still chooses the correct model. Alter- 
natively, in Figure 2(c) the Lasso correctly selects the gray signal variable, 
while Forward Selection chooses the gray dashed noise variable. Finally, in 
Figure 2(d) both the Lasso and Forward Selection incorrectly select noise 
variables. 

2.2. An adaptive shrinkage methodology. A key observation from Fig- 
ure 2 is that in all four plots the correct solid grey signal variable has the 
maximal correlation for at least some levels of shrinkage, even in situations 
where the Lasso and Forward Selection fail to identify the correct model. 
This example illustrates that choosing the variable most highly correlated 
with the residuals can work well provided the correct level of shrinkage is 
used. This observation motivates our "Forward-Lasso Adaptive SHrinkage" 
(FLASH) methodology. 

Like the Lasso and Forward Selection, FLASH begins with the null model 
containing no variables and then implements the following procedure: 

1. At each step add to the model the variable most highly correlated with 
the current residual vector. 

2. Move the coefficients for the currently selected variables a given distance 
in the direction toward the corresponding ordinary least squares solution. 

3. Repeat steps 1 and 2 until all variables have been added to the model. 

The FLASH algorithm is similar to that for LARS and Forward Selection. 
The main difference revolves around the distance that the coefficients are 
driven toward the least squares solution. For the Ith step in the FLASH al- 
gorithm this distance is determined by a tuning parameter, Si. Setting <5; = 
corresponds to the Lasso stopping rule, that is, driving the coefficients until 
the maximum of their absolute correlations intersect with that of another 
variable. Alternatively, 5i = 1 corresponds to the Forward Selection approach 
where the coefficients are set equal to the corresponding least squares solu- 
tion. However, setting <5/ = ^, for example, causes the coefficients to be driven 
half way between the Lasso and the Forward Selection stopping points. As a 
result, FLASH can adjust the level of shrinkage not just on the final model 
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coefficients, as used previously in, for example, the Relaxed Lasso, but also 
at each step during the selection of potential candidate models. 

Figure 3 illustrates potential coefficient paths, for the first two variables 
selected, for each of the three different approaches. The horizontal solid 
line in each plot shows the path for the first variable selected, j3\. The 
first plot illustrates Forward Selection where f3\ is driven all the way to 
the least squares solution, represented by the first cross. Alternatively, the 
Lasso (second plot) only drives f3\ a quarter of the way to the least squares 
solution. Finally, the third plot shows one possible FLASH solution. Here 
we have marked 5\ = for the Lasso solution and 5\ = 1 for the Forward 
Selection estimate. In this case we set 5\ = ^ and, hence, the corresponding 
FLASH estimate for /3i is half way between the Lasso and Forward Selection 
coefficients. The sloped solid line on each plot illustrates the continuation of 
the paths to estimate both j3\ and /?2- Again, Forward Selection drives /?i 
and 02 to their joint least squares solution, while the Lasso estimate only 
moves part way in this direction. The final plot shows the FLASH estimate, 
again setting 82 = \- 

In the following section we describe two different approaches for letting 
the data select the optimal level of shrinkage at each step. In some situations, 
for example, where a subset of the true variables has a high signal, we may 
wish to adopt the Forward Selection approach with no shrinkage. In other 
situations, for example, where there is a lot of noise, the highly shrunk Lasso 
estimates may be preferred. But, as we show in the simulation results, often a 
level of shrinkage between these two extremes gives superior results. Another 
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Fig. 3. Example coefficient paths for a two variable example using Forward Selection 
(crosses), the Lasso (triangles) and FLASH (circles). 
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strength of FLASH is that its coefficient path can be efficiently computed 
using a variant of the LARS algorithm, which we outline next. 

We use index I to denote each step of the algorithm, but for simplicity of 
the notation we omit this index wherever the meaning is clear without it. 
Throughout the algorithm index set A represents the correlations that are 
being driven toward zero, vector c_4 contains the values of these correlations, 
and Xj± denotes the matrix consisting of the columns of X associated with 
the set A. We refer to this set and the corresponding correlations as "active." 
Note that the active absolute correlations are driven toward zero at rates 
that are proportional to their magnitudes: 

1. Initialize /3 1 = 0, A = and I = 1. 

2. Update the active set A by including the index of the (new) maximal 
absolute correlation. Compute the | A | -dimensional direction vector h_4 = 
(A^A_4) _1 c^4. Let h be the p-dimensional vector with the components 
corresponding to A given by h_4, and the remainder set to zero. 

3. Compute jl, the Lasso distance to travel in direction h until a new ab- 
solute correlation is maximal. We provide the formulas in the Appendix, 
where we also show that 7_p, the Forward Selection distance to travel in 
direction h until the active correlations reach zero, equals one. Define 
7 = lL + Si(l - 7l) and let (3 l+1 = (3 l + 7 h. Set I <- I + 1. 

4. Repeat steps 2 and 3 until all correlations are at zero. 

Our attention has recently been drawn to the Forward Iterative Regression 
and Shrinkage Technique (FIRST) in Hwang, Zhang and Ghosal (2009), 
which can perform effectively in sparse high-dimensional settings. FIRST 
also utilizes aspects of the Forward Selection and Lasso approaches, but in a 
rather different fashion than FLASH. For example, in the orthogonal design 
matrix situation FIRST, when run to convergence, returns the Lasso fit, 
while FLASH still produces a continuum of solutions between those of the 
Lasso and Forward Selection. 

2.3. Modifications to the algorithm. In practice, we propose implement- 
ing FLASH with the following two modifications. First, note that when all 
81 are set to zero, the algorithm above reduces to the basic LARS algorithm, 
which does not necessarily recover the Lasso path. To ensure that FLASH is 
a generalization of the Lasso, we implement FLASH using the same modifi- 
cation as the LARS algorithm uses to compute the Lasso path, that is, if at 
any point on the path a coefficient hits zero, then the corresponding variable 
is removed from the active set. A detailed description of this modification is 
given in the Appendix. 

Second, to account for the potential over-shrinkage of the coefficients in 
a sparsely estimated model, we implement a "relaxed" version of FLASH, 
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which extends FLASH analogously to the way that the Relaxed Lasso ex- 
tends the Lasso. We unshrink each solution located at a breakpoint of the 
FLASH path, connecting it via a path with the ordinary least squares solu- 
tion on the corresponding set of variables. We do this as soon as the FLASH 
breakpoint is computed, in other words, right after the third step of the 
algorithm. As with the Relaxed Lasso, the calculation of the corresponding 
relaxation direction comes at no computational cost, as it coincides with the 
current direction of the FLASH path. More specifically, the original FLASH 
solution after step 3 is given by f3 l + jh, and the corresponding OLS solu- 
tion is given by (3 + h. The corresponding relaxation path is given by linear 
interpolation between these two points. 

For the remainder of this paper, when we refer to FLASH, we mean the 
modified version. In our numerical examples the final solution is selected 
via cross-validation as a point on one of the relaxation paths, where each of 
these continuous paths is replaced by its values on a fixed grid. 

2.4. Selection of tuning parameters. An important component of FLASH 
is the selection of the 5i parameters. Clearly, treating each 5i as an indepen- 
dent tuning parameter is not feasible. Many model selection approaches 
could be utilized. In this paper we investigate two possible approaches. The 
first, "global FLASH," involves selecting a single value, 5, for all the step 
sizes, that is, assuming a common level of shrinkage throughout the steps of 
the FLASH algorithm. Hence, 5 = corresponds to the Lasso and 5 = 1 to 
Forward Selection. Using this approach, we first choose a grid of <5's between 
and 1 and then select the value giving the lowest residual sum of squares 
on a validation data set or, alternatively, the lowest cross- validated error. 
Global FLASH has the advantage of only needing to select one 5, which 
improves its computational efficiency. 

The second approach, "block FLASH," allows for different values among 
the <5z's. However, to make the problem computationally feasible, we con- 
strain each 5i to be either zero or one. The version of block FLASH we 
focus on exclusively for the remainder of the paper involves selecting a sin- 
gle "break point" with 5i* = 1 and setting all remaining <5;'s to zero. This 
has the effect of dividing FLASH into two stages. In the first stage a series 
of Lasso steps (i.e., 5i = 0) are performed to select the initial variables. At 
the end of the first stage a Forward step (i.e., 5i* =1) is performed which 
has the effect of removing the coefficient shrinkage on the currently selected 
variables. In the second stage further variables are selected by performing 
a series of Lasso steps. As with global FLASH, block FLASH has the ad- 
vantage of only needing to select one tuning parameter, the break point. In 
Section 3 we provide simulation results for both versions of FLASH. In prac- 
tice, the two methods appear to perform similarly. However, as illustrated 
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below, we are able to establish some interesting theoretical properties for 
block FLASH. 

Note that for each fixed 5 or, correspondingly, each fixed I*, global and 
block FLASH both have the same computational cost as the LARS algo- 
rithm. Because LARS is extremely efficient, so are the FLASH algorithms, 
in particular, they require the same order of calculations as LARS if the grid 
size for 5 and the number of locations for I* are finite. We propose using a 
five value grid for 5, which worked very well in our simulation study. The 
upper bound on the number of potential locations for I* can be chosen based 
on the computational complexity of the problem. Remember that I* repre- 
sents the number of easily identifiable predictors, so one might reasonably 
expect a relatively low value. 

2.5. Theoretical arguments. In this section we present some variable se- 
lection properties of FLASH, in particular, conditions under which it can be 
shown to outperform the Lasso. Throughout this section "probability tend- 
ing to one" refers to the scenario of p going to infinity. For the standard case 
of bounded p, we could think of n going to infinity instead, although some 
minor modifications would need to be made to the statements of the results. 
Let K index the nonzero coefficients of (3. We will say that an estimator 
f3 recovers the correct signed support of (3 if sign(/3) =sign(/3), where the 
equality is understood componentwise. 

We will take a common approach of imposing bounds on the maximum 
absolute correlation between two predictors. Define S as the number of sig- 
nal variables, fi = maxj>fc |XjX^| and let £ be an arbitrarily small positive 
constant. The results of Zhao and Yu (2006) and Wainwright (2009) imply 
that if 

(2) minimi > c iy /S log p 

and \x < — with /xl = l/[25 — 1], then the Lasso solution correspond- 
ing to an appropriate choice of the tuning parameter will recover the correct 
signed support of f3 with probability tending to one. Here the constant c\ 
does not change with n and p, and its value is provided in the supplemental 
article [Radchenko and James (2010)]. On the other hand, the number of true 
nonzero coefficients, S, is allowed to grow together with n and p. Note that 
condition (2) is stated for the rescaled coefficients that correspond to the 
standardized predictor vectors. On the original scale the right-hand side in 
(2) would be of order sj (Slogp)/n. Suppose, for example, that S is bounded 
and p grows polynomially in n. In this case the lower bound on the mag- 
nitudes of the nonzero coefficients, expressed on the original scale, goes to 
zero at the rate \J (log n)/n. 

The correlation bound above is tight in the sense that for each fi> fX£ 
there are values of X T X and sign(/3) such that the Lasso fails to recover 
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the correct signed support. In the following claim we identify a class of such 
counterexamples . 

Claim 1. Let p be a constant satisfying p> jjll and let j be an arbitrary 
index in K c . Suppose that all the pairwise correlations among the predictors 
indexed by KL) {j} equal —p, and all the signs of the nonzero coefficients of 
P are negative. Then, with probability at least 1/2, no Lasso solution recovers 
the correct signed support of (3. 

The proof of the claim is provided in the supplemental article [Radchenko 
and James (2010)]. Note that for p < 1/S the correlation matrix can be easily 
made positive definite by setting all the nonspecified pairwise correlations 
to zero. 

Our Theorem 1 establishes that, under an additional assumption on the 
magnitudes of the nonzero coefficients, block FLASH can work in the situa- 
tions where the Lasso fails. The intuition behind this additional assumption 
is that for many regression problems there will be some signal variables that 
are relatively easy to identify, while the remainder pose more difficulties. 
The block FLASH procedure utilizes the first group of signal variables in a 
more efficient fashion and hence is better able to identify the remaining pre- 
dictors. To mathematically quantify this intuition, suppose that there exist 
indexes a and b, such that the corresponding true coefficients are nonzero 
and have a significant separation in the magnitudes, that is, a large value 
of \/3 a /f3b\- We will refer to the coefficients {/3j:\/3j\ > \/3 a \} as large, and 
the coefficients {/?,• :0 < \/3j \ < as small. Theorem 1 below states that 
if the ratio \/3 a //3b\ is sufficiently large, then the block FLASH procedure 
will correctly identify the signal variables under a weaker assumption on the 
maximal pairwise correlation. More specifically, at the first stage the proce- 
dure will identify all the large nonzero coefficients and not pick up any noise, 
and at the second stage it will pick up the remaining nonzero coefficients 
without bringing in the noise. As we discuss at the end of the section, the 
new correlation bound, pfl, is strictly larger than the Lasso bound, m,. 

Theorem 1. Suppose that condition (2) holds, inequality \(3 a /Pb\ > c^^/S 
is satisfied for an arbitrary pair of true nonzero coefficients, and p < /Ufl(1 — 
£) for some arbitrary constant £. Then, with an appropriate choice of the 
tuning parameters, the block FLASH estimator recovers the correct signed 
support of (3 with probability tending to one. 

Here the constant C3 does not change with n and p, and its value is pro- 
vided in the supplemental article [Radchenko and James (2010)] together 
with the proof of the theorem. Like the corresponding Lasso result in Wain- 
wright (2009), our theorem can handle subgaussian errors, that is, the tails 
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of the error distribution are required to decay at least as fast as those of a 
gaussian distribution. Relative to the Lasso result, the only new assumption 
is on the separation between the large and the small coefficients. Conse- 
quently, we are able to relax the requirement on the pairwise correlations. 
According to Claim 1, the new assumption does not allow us to relax the 
pairwise correlation requirement for the Lasso, as the nonzero coefficients of 
(3 affect the claim only through their signs. Applying Theorem 1 in the setup 
of the claim yields that the correct signed support of f3 can be recovered for 
all p < ^fl- In other words, under an additional assumption on the magni- 
tudes of the nonzero coefficients, block FLASH succeeds for pG [a*l,A*fl)) 
where the Lasso fails. 

The correlation bound in Theorem 1 can be taken as 



Here q\ and q<i are the fractions of large and small coefficients, respec- 
tively, among all the nonzero coefficients. More specifically, q\ = \{/3j : \j3j\ > 
\0a\}\/S and q 2 = |{/9,-:0 < \/3j\ < \/3 b \}\/S. Observe that p FL > Ml when 
q\S > 1. In fact, the proof of Theorem 1 reveals that the best possible value 
of /ifl is strictly above m, for all positive 31. 

3. Simulation results. In this section we present a detailed simulation 
study comparing FLASH to five natural competing approaches. We imple- 
mented both the global (FLASHg) and the block (FLASHb) versions of 
our method discussed in Section 2.4. The tuning parameter 6 in FLASHg 
was selected from a grid of five possible values, {0, 0.25, 0.5, 0.75, 1}. We also 
tried a {0, 1} grid corresponding to the Lasso and Forward Selection, and 
a {0, 0.5, 1} grid, but the results were inferior, so we do not report them 
here. 

We compared FLASH to VISA, the Relaxed Lasso (Relaxo), the Adaptive 
Lasso (Adaptive), Forward Selection (Forward) and the Lasso. The Adaptive 
Lasso involves a preliminary step where the weights are typically chosen by 
performing a least squares fit to the data. This is not feasible for p > n, so we 
selected the weights using either the simple linear regression fits, as suggested 
in Huang, Ma and Zhang (2008), or a ridge regression fit, as suggested in 
Zou (2006). The ridged fits dominated so we only report results for the latter 
method here. 

Our simulated data consisted of five parameters which we varied: the 
number of variables (p = 100 or p = 200), the number of training observa- 
tions (n = 50, n = 70 or n = 100), the correlations among the columns of the 
design matrix (p = or p = 0.5), the number of nonzero regression coeffi- 
cients (S = 10 or S = 30) and the standard deviation among the coefficients 



(3) 




FORWARD-LASSO WITH ADAPTIVE SHRINKAGE 



13 



(erg = 0.5, erg = 0.7 or op = 1). We tested most combinations of the parame- 
ters and report a representative sample of the results. The rows of the design 
matrix were generated from a mean zero normal distribution with a corre- 
lation matrix whose off-diagonal elements were equal to p. The error terms 
were sampled from the standard normal distribution, while the regression 
coefficients were generated from a mean zero normal with variance <r|. For 
each simulated data set we randomly generated a validation data set with 
half as many observations as the training data and selected the various tun- 
ing parameters for each method as those that gave the lowest mean squared 
error between the response and predictions on the validation data. In partic- 
ular, both the relaxation parameter and the number of steps in the algorithm 
for the FLASH methods and the Relaxed Lasso were selected using a val- 
idation set. For each method and simulation we computed three statistics, 
averaged over 200 data sets: False Positive, the number of variables with 
zero coefficients incorrectly included in the final model; False Negative, the 
number of variables with nonzero coefficients left out of the model; and L2 
square, the squared L2 distance between the estimated coefficients and the 
truth. Table 1 provides the results. 

The first four simulations corresponded to p = 0, while the next four were 
generated using p = 0.5. The ninth simulation was a denser case with S = 30. 
Finally, the last four simulations represent harder problems with or = 0.7 or 
0.5, reducing the signal to noise ratio from 10 to 4.9 and 2.5, respectively. 
For the L2 square statistic we performed tests of statistical significance, 
comparing each method to the best FLASH approach. For each simulation 
we placed in bold the L2 square value for the best method and any other 
method that was not statistically worse at the 5% level of significance. For 
example, in the first simulation with 100 variables and 100 observations 
both versions of FLASH and Forward were statistically indistinguishable 
from each other. However, in the third simulation with 100 variables and 
50 observations FLASHq was statistically superior to all other methods. 
Most of the standard errors for the L2 square statistic were relatively low, 
approximately 4% of the statistic's value. However, as has been observed 
previously, we found that the Forward method often gave more variable 
estimates than the other approaches, with some standard errors as high as 
8% of the statistic's value. 

None of the thirteen simulations contained a situation where one of the 
competing methods was statistically superior to FLASH, while in ten of the 
simulations FLASH was statistically superior to all other methods. In gen- 
eral, Forward Selection performed well in the easiest scenarios with large n, 
zero correlation, p, and higher signal, op = 1. In particular, Forward Selec- 
tion performed very poorly in the denser S = 30 scenario, while this was a 
favorable situation for the Lasso. FLASH was still superior to both methods 
in this simulation setup. The Adaptive Lasso, VISA and Relaxo all provided 
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Table 1 

Simulation results for each method. L2 square denotes the squared L2 distance between 
the estimated coefficients and the truth, averaged over the 200 simulated data sets. For 
each simulation scenario we placed in bold the best L2 square value together with the L2 
square value for any other method that was not statistically worse at the 5% level of 

significance 



Simulation 
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FLASHb 
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1.89 
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0.436 
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1.99 
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0.775 


0.848 


0.996 
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1.228 


0.929 


1.285 
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50, p = 


200 


False-Pos 


3.73 


7.24 


6.46 


6.84 


12.89 


1.71 


18.54 


s= 


10, p = 





False-Neg 


3.83 


3.4 


4.06 


4.04 


3.81 


4.57 


3.04 




= 1 




L2-sq 


1.057 


1.089 


1.477 
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1.934 
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0.546 
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0.797 
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100, p = 
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False- Pos 


3.35 


6.35 


7.06 


7.33 


11.82 
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21.72 


s= 


10, p = 


0.5 


False- Ncg 


3.12 


2.88 


3.06 


3.11 


3.01 


3.57 


2.23 


o-p ~- 


= 1 




L2-sq 


0.608 


0.673 


0.752 


0.785 


0.872 


0.655 


1.029 


n — 


50, p = 


100 


False- Pos 


5.12 


8.31 


7.23 


7.44 


11.27 


2.42 


16.2 


S = 


10, p = 


0.5 


False-Neg 


3.95 


3.38 


3.79 


3.88 


3.53 


4.82 


2.99 


o-p = 


= 1 




L2-sq 


1.732 


1.743 


1.84 


1.901 


2.088 


2.38 


2.199 


n — 


50, p = 
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False-Pos 


5.82 


9.62 


8.8 


8.77 
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2.37 


18.28 


S = 
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0.5 


False-Neg 


5.14 


4.45 


5.04 


5.12 


4.9 


6.25 


4.34 


CT/3 = 


= 1 




L2-sq 


2.399 


2.35 


2.648 


2.7 


2.851 


3.094 


2.934 


n — 
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False-Pos 


10.6 


13.73 


11.34 


12.09 


15.62 


4.39 


17.16 
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14.23 
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21.7 


11.95 


o-p ~- 
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L2-sq 


10.559 


9.051 


10.749 
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False-Pos 


3.54 


4.72 


6.09 


6.19 
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1.84 


17.86 


s= 
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0.5 


False-Neg 


3.73 


3.54 


3.51 


3.56 


3.34 


4.24 


2.46 
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L2-sq 


0.625 


0.624 


0.692 


0.705 


0.724 
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False-Pos 


4.06 
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7.48 
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1.68 


21.46 
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0.5 
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3.81 
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3 
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L2-sq 


0.686 


0.731 


0.788 


0.794 


0.799 


0.77 


0.922 


n — 


100, p = 


= 100 


False-Pos 


3.81 


4.88 


6.11 


5.93 


9.65 


1.99 


15.78 


S = 


10, p = 


0.5 


False-Neg 


4.74 


4.49 


4.46 


4.51 


4.16 


5.48 


3.42 


CT/3 = 


= 0.5 




L2-sq 


0.559 


0.545 


0.576 


0.584 


0.596 


0.683 


0.633 


n — 


100, p~- 
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False-Pos 


3.69 


5.25 


5.76 


6.13 


10.11 


1.54 


17.73 


s= 


10, p = 


0.5 


False-Neg 


5.38 


5.08 


5.29 


5.32 


4.88 


6.03 


4.24 


CT/3 = 


= 0.5 




L2-sq 


0.664 


0.662 


0.696 


0.709 


0.715 


0.761 


0.769 
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improvements over the Lasso, though the latter two methods generated the 
largest increase in performance. The two versions of FLASH performed at 
a similar level, though FLASHg seemed slightly better in the sparser cases, 
while FLASHb was superior in the denser S = 30 situation. FLASHg also 
required less computational effort, because its path only needed to be com- 
puted once for each of the five potential values of 8. 

Overall, Forward Selection had low false positive but high false nega- 
tive rates. In comparison to VISA and Relaxo, FLASHg had the lowest 
false positive rates and similar or lower false negative rates. Alternatively, 
FLASHb had very low false negative rates and similar false positive rates. 
Overall, FLASH selected sparser models than VISA, the Relaxed Lasso and 
the Adaptive Lasso, and significantly sparser models than the Lasso. 

4. Extending to generalized linear models. 

4.1. Methodology. In the generalized linear model framework for a re- 
sponse variable, Y, with distribution 

p(y, o,4>) = exp f ~^p + c ( y ' > 

one models the relationship between predictor and response as g(fii) = 
Y?j=iXij/3j, where fjn = E(Y;6i,(j>) = b'(8i), and g is referred to as the link 
function. Common examples of g include the identity link used for normal 
response data and the logistic link used for binary response data. For no- 
tational simplicity we will assume that g is chosen as the canonical link, 
though all the ideas generalize naturally to other link functions. The co- 
efficient vector j3 is generally estimated by maximizing the log likelihood 
function, 

n 

(4) l(P) = Y,(YiXf(3-b(X.Jp)). 

i=l 

However, when p is large relative to n, the maximum likelihood approach 
suffers from problems similar to those of the least squares approach in linear 
regression. First, maximizing (4) will not produce any coefficients that are 
exactly zero, so no variable selection is performed. As a result, the final 
model is less interpretable and probably less accurate. Second, for large p 
the variance of the estimated coefficients will become large and when p> n, 
function (4) has no unique minimum. 

Various solutions have been proposed. Park and Hastie (2007) discuss a 
natural GLM extension of the Lasso (GLasso) where, for a fixed A, they 
choose (3 to minimize 

(5) l(f3,X) = -l((3) + X\\(3\\i- 
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The coefficient paths for the GLasso are not generally piecewise linear, but 
Park and Hastie present an algorithm for approximating the true path. For- 
ward Selection can also be easily extended to the GLM domain by starting 
with an empty set of variables and, at step I, adding to the model the variable 
that maximizes the jth partial derivative of the log likelihood, lj((3i). One 

then sets 0i + i equal to the maximum likelihood solution corresponding to 
the currently selected variables and repeats. Note that = Xj(Y — /}), 

which is just the correlation between the jth predictor and the residuals. 
When using Gaussian errors with the identity link function, fi = Xj3, so 
this algorithm reduces back to standard Forward Selection in the regression 
setting. 

The GLM versions of the Lasso and Forward Selection also suggest a nat- 
ural extension of FLASH to this domain. In the GLM FLASH algorithm we 
again start with an empty set of variables, A\, and /3 1 = 0. Then at step I 
we add to the model the variable that maximizes the jth partial derivative 
of the log likelihood, l'j(Pi), that is, the variable with maximal correlation. 

Finally, we drive f3[ + i in the direction toward the maximum likelihood solu- 
tion with the distance determined by r5j. Again, 5i = corresponds to shifting 
Pi as far as the Lasso stopping point, while 5i = 1 represents the maximum 
likelihood solution. However, one key difference between the GLM and stan- 
dard versions of FLASH is that, because the coefficient paths are no longer 
piecewise linear, the coefficients do not move in a linear fashion toward the 
maximum likelihood solution. 

Figure 4 provides a pictorial example in the same two variable domain as 
for Figure 3. The GLasso still significantly shrinks the coefficients relative 



Forward Lasso FLASH 




5i = S-i = i (5i = 1 

Pi Pi Pi 

Fig. 4. Example coefficient paths for a two variable GLM example using Forward Selec- 
tion (crosses), the Lasso (triangles) and FLASH (circles). 
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to the Forward Selection approach. Alternatively, FLASH provides an in 
between level of shrinkage. However, notice that the coefficient paths now 
move in a curved fashion toward the maximum likelihood solution. It is 
possible to compute this nonlinear path on a grid of tuning parameters, and 
we present the precise algorithm in the Appendix. 

The block FLASH approach is particularly appealing in the GLM set- 
ting, because it is both conceptually simple and easy to implement. With 
this method FLASH follows the GLasso path for the first I* — 1 steps, that 
is, 5± = §2 = ■ ■ ■ = Si*— i = 0. At this point the maximum likelihood solution 
for the currently selected variables is computed, that is, <5/* = 1. Finally, the 
GLasso path is followed again with zero penalty on the variables correspond- 
ing to Ai*, that is, <5;*+i = • • • = 0. We compute the GLM version of block 
FLASH using two implementations of the R function glmnet(-) [Friedman, 
Hastie and Tibshirani (2010)], which uses a coordinate descent algorithm to 
minimize (5). We first use glmnet(-) to compute the path prior to I* and 
then make a second call to the function to compute the path after I*, placing 
zero penalty on the variables selected in the first step. 

4.2. Simulation study. In this section we provide a simulation compar- 
ison of the block GLM FLASH method with several other standard GLM 
approaches. In particular, we compared FLASH to "GLasso," "GRelaxo," 
"GForward" and the standard "GLM." GLasso is implemented using the R 
function glmnet(-). GRelaxo takes the same sequence of models suggested 
by GLasso but unshrinks the final coefficient estimates using a standard 
GLM fit to the nonzero coefficients. GForward uses the approach outlined 
previously. 

We simulated responses from the Bernoulli distribution using the logistic 
link function. The data were generated with p = 100 variables, but we in- 
creased the sample size to n = 400 as the Bernoulli response provided less 
information compared to the Gaussian response. The correlation among the 
predictors was set to either p = or p = 0.5, and the number of true signal 
variables was set to either S = 10 or S = 15. Finally, the nonzero regression 
coefficients were randomly sampled from either a point mass distribution, 
with probability a half of being 0.5 or —0.5, or the standard normal distri- 
bution. The tuning parameters for all methods were selected as those that 
minimized the "deviance" on a validation data set with n = 200 observa- 
tions. In all other respects the simulation setup was the same as the one we 
used in the linear regression setting. 

The results from five different simulations are provided in Table 2. Stan- 
dard GLM performs very poorly. Note we have reported the median errors 
for this method because the algorithm did not converge properly for some 
simulations. GForward was competitive with FLASHb when using uncorre- 
cted predictors but deteriorated in the correlated situation. In all scenarios 
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Table 2 



Simulation results for each method using a Bernoulli response. L2 square denotes the 
squared L2 distance between the estimated coefficients and the truth, averaged over the 
200 simulated data sets. For each simulation scenario we placed in bold the best L2 





square value together with the L2 square 


value for 


any other 


method that 


was not 






statistically 


worse at tht 


1 5% level 


of significance 




Simulation 


Statistic 


FLASHb 


GRelaxo 


GLasso 


GForward 


GLM 


n 


= 400, p = 100 


False-Pos 


4.38 


4.47 


16.61 


1.18 


90 


S 


= 10, ,9 = 


False-Neg 


0.34 


0.45 


0.06 


0.55 





P 


= ±0.5 


L2-sq 


0.461 


0.488 


0.71 


0.454 


6.065 


n 


= 400, p = 100 


False-Pos 


9.1 


10.11 


15.57 


1.54 
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= 10, p = 0.5 


False-Neg 


1.69 


1.82 


0.92 
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L2-sq 
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1.415 


10.559 
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S 
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3.07 
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0.72 
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1.954 


0.668 
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2.17 
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0.7 


85 


S 


= 15, p = 0.5 


False-Neg 


5.57 


5.38 


3.66 
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P 


= iV(0,l) 


L2-sq 


2.302 


2.692 


4.211 


2.487 


11903.88 



FLASHb either had the lowest L2-sq statistic or was not statistically differ- 
ent from the best. In the last two simulations it was statistically superior to 
all the other approaches. 

5. Empirical analysis. We implemented the global, block and GLM ver- 
sions of FLASH on three different real world data sets. The first contained 
salaries of professional baseball players (obtained from StatLib, Department 
of Statistics, CMU). For each player a number of statistics were recorded, 
such as career runs batted in, walks, hits, at bats, etc. We then used these 
variables to predict salaries. After including all possible interaction terms, 
the data set contained n = 263 observations and p = 153 predictors. 

We tested three competitors to FLASH, namely, Lasso, Forward Selection 
and the Relaxed Lasso. For each of the four methods ten-fold cross-validation 
was used to compute the root mean squared error (RMSE) in prediction 
accuracy at various points of the coefficient path. The final results are illus- 
trated in Figure 5(a). The open circles represent the Lasso RMSE's evaluated 
at the break points of the LARS algorithm. Alternatively, the solid circles 
show the errors corresponding to the least squares fits for the models se- 
lected by the Lasso. The dashed line that connects the open and the solid 
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Fig. 5. (a) The cross-validated root mean squared error plotted versus the number of 
steps in the corresponding algorithm for four different methods in the example of predicting 
baseball players' salary. The methods displayed are Forward Selection (dotted line), Lasso 
(open circles), OLS fits corresponding to the Lasso models (solid circles), Relaxed Lasso 
(dashed line, connecting the open and the solid circles) and FLASH with 5 = 0.25 
black line), (b) Same as (a) with more steps. 



circles illustrates the Relaxed Lasso fit as the coefficient shrinkage is reduced 
from the Lasso estimate (maximum shrinkage) to the least squares fit (no 
shrinkage). The dotted line corresponds to Forward Selection. Finally, the 
black solid line represents the global FLASH fit with 5 = 0.25. We have fixed 
the value of the 5 parameter to ensure fair comparison with other methods 
on the basis of the cross- validated RMSE. In our simulations we used a five 
point grid, where the end points corresponded to the Relaxed Lasso and For- 
ward Selection, respectively. Among the remaining three values, we decided 
to pick 5 = 0.25, as it is closer to the Relaxed Lasso, which is in general a 
more reliable method then Forward Selection. 

From step 15 onward, Forward Selection begins to significantly deterio- 
rate, while FLASH continues to improve and eventually achieves the lowest 
error rate of all four methods at approximately step 20. Figure 5(b) plots 
the cross-validated error paths out to 80 steps. The Relaxed Lasso achieves 
its optimal results at approximately step 50, which corresponds to a 34 vari- 
able model. Not only is the optimal error rate worse than that for FLASH, 
but the corresponding model contains twice as many variables as the model 
selected by FLASH, which only had 17 variables. Our simulation results 
pointed to a similar phenomenon, and we have noticed in other real data 
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Table 3 

Mean squared errors, averaged over 100 test data sets, and 
average number of variables selected, for the Boston Housing 

data 





FLASH 


Relaxo 


Lasso 


Forward 


MSE 


27.01 


28.30 


29.56 


33.03 


Number of coefficients 


18.93 


17.13 


26.99 


16.8 



sets that FLASH tends to select sparser models, suggesting FLASH may 
have an advantage in terms of inference in addition to prediction accuracy. 

The second data set we examined was the Boston Housing data, commonly 
used to compare different regression methods. After including interaction 
terms this data contained 90 predictors of the average house value in 506 
locations. To test the p ~ n scenario, 90 observations were randomly sampled 
for the training data, 45 observations for a validation data set, and the 
remainder for the testing data. We implemented the block FLASH approach 
and compared it to the Lasso, Relaxo and Forward Selection. Least squares 
fits were used for the final coefficient estimates on all methods except the 
Lasso. Hence, for example, the Relaxed Lasso solutions simplify to the OLS 
solutions computed for the sequence of models specified by the Lasso. Each 
approach was fitted using the training data, with the tuning parameters 
chosen using the validation data, and then the mean squared error was 
computed on the test data. This procedure was repeated using 100 different 
random samplings of the data, to average out any effect from the choice of 
test sets. Table 3 shows, for each method, the average mean squared error as 
well as the average number of coefficients chosen in the final model. Block 
FLASH achieves the lowest mean squared error. In addition, FLASH, Relaxo 
and Forward Selection all choose significantly smaller models than the Lasso, 
making their results more interpretable. On average, block FLASH selected 
8.67 variables before implementing the Forward step. FLASH resulted in 
lower MSE than Relaxo in 62 random splits of the data, the two methods 
had the same MSE in 3 splits, and FLASH had a higher MSE in 35 of 
the splits. The corresponding numbers comparing FLASH to the Lasso and 
FLASH to Forward Selection were 63/0/37 and 81/0/19. 

The final data set that we examined was the internet advertising data 
available at the UC-Irvine machine learning repository. The response was 
categorical, indicating whether or not each image was an advertisement. 
The predictors recorded the geometry of the image as well as whether certain 
phrases occurred in and around the image url's. After preprocessing, the 
data set contained n = 2359 observations and p = 1430 variables. The large 
value of p presented significant statistical and computational difficulties for 
standard approaches, with the glm(-) function in R taking almost fifteen 
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minutes to run and producing NA estimates for most coefficients. However, 
we were able to implement GLM block FLASH, randomly assigning two- 
thirds of the observations to the training data set and the remainder to the 
validation data set. FLASH selected a twenty-seven variable model, with 
the five most important variables, in terms of the order that they entered 
the model, being the width of the image, whether the image's anchor url 
contained the phrase "com," whether the url contained the phrase "ads," 
and whether the anchor URL contained the phrases "click" and "adclick." We 
also fixed the tuning parameters at the selected values and used a bootstrap 
analysis to produce pointwise confidence intervals on the coefficients. GLM 
block FLASH ran very efficiently, taking approximately twenty seconds to 
produce the corresponding estimator on each bootstrapped data set. The 
misclassification error on the validation set was 2.9% for FLASH, while it 
was 3.2%, 4.0% and 5.0%, for GRelaxo, GLasso and GForward, respectively. 
The sizes of the models selected by the last three methods were 38, 77 and 
16, respectively. 

6. Discussion. The main difference between Forward Selection and the 
Lasso is in the amount of shrinkage used to iteratively estimate the regression 
coefficients. For any given data set there is no particular reason that either 
the zero shrinkage of Forward Selection or the extreme shrinkage of the 
Lasso must produce the best solution. FLASH allows the data to dictate 
the optimal level of shrinkage at the model selection stage. This is quite 
different from approaches such as Relaxo that adjust the level of shrinkage 
after the model has been selected but not while choosing the sequence of 
models to consider. As a result, FLASH often produces sparser models with 
superior predictive accuracy. 

Computational efficiency is always important for high-dimensional prob- 
lems. The standard FLASH algorithm is very similar to LARS and hence 
involves a relatively small computational expense. In addition, the block 
FLASH approach can easily be formulated as a penalized regression prob- 
lem with the usual L\ penalty before the Forward step and zero penalty on 
certain variables after this step. Hence, even more efficient methods, such as 
the recent work on pathwise coordinate descent algorithms [Friedman et al. 
(2007)], can be used to compute the path, not only for regression problems, 
but also for our extension to GLM data. Indeed, the glmnet(-) function that 
we used to fit GLM block FLASH utilizes a coordinate descent algorithm. 

APPENDIX A: STEP 3 OF THE FLASH ALGORITHM 

Let be one of the active correlations with the maximum absolute value. 
Then, as with LARS, the first time a nonactive absolute correlation reaches 
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the "active" maximum corresponds to the step size of 

C^* Cj Cj* ~\~ Cj 



= mm 



jsA°\(X*-Xj) T Xh' (Xi» + Xj) T Xh J ' 

where the minimum is taken over the positive components. 

Along the direction h, all active correlations reach zero at the same time. 
Hence, the Forward Selection step size is given by 

1F = Xph = *lX A {X T A X A )-^c A = ■ 



APPENDIX B: ZERO CROSSING MODIFICATION 

The basic FLASH algorithm described in Section 2.2 shares the following 
property with the basic LARS algorithm: once a variable enters the model, it 
does not leave. Recall that the Lasso solution path can be obtained from the 
modified LARS algorithm, where if a coefficient hits zero, the corresponding 
variable is removed from the active set, and hence the model as well. When 
a variable is removed, the corresponding absolute correlation goes below the 
value it would be at if it remained active. The variable rejoins the model if its 
absolute correlation reaches the value it would be at, had the variable stayed 
in the model. We provide a similar modification to the FLASH algorithm. 

Definition 1 (Zero crossing modification). When a coefficient hits zero, 
the corresponding variable is removed from the active set. The variable is 
added back to the active set once the corresponding absolute correlation 
reaches the value it would currently be at had it remained active. Also, 
while the variable is out of the active set, it is ignored in the calculation of 
the maximum absolute correlation in step 2 of the FLASH algorithm. 

In LARS it is easy to keep track of what the absolute correlation value 
would be if the removed variable remained active: it is just the value of the 
maximum absolute correlation. In FLASH this value is also easy to keep 
track of, because all pairwise ratios among the active absolute correlations 
stay fixed throughout the algorithm. 

APPENDIX C: PATH ALGORITHM FOR THE GLM FLASH 

By analogy with LARS and GLasso, the GLM FLASH algorithm pro- 
gresses in piecewise linear steps. Our algorithm is a modification of the one 
in Park and Hastie (2007). Throughout the algorithm we write A for the vec- 
tor of absolute correlations between the predictors and the current residuals, 
\X (Y — We start with (3 = 0, ft = Yl, and the active set, A, consisting 
of j* = argmaxAj. We decrease the value of ||A^||oo from \Xj*(Y — Yl)\ to 
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zero along a data dependent grid. At each grid point we take the following 
four steps. The details of steps 1 and 2 are discussed in Park and Hastie 
(2007): 

1. Predictor. Linearly approximate the solution to (6); call it f3. 

2. Corrector. Use /3 as the warm start to produce /3, the exact solution to 

(6) min (-l((3) + J2m 

3. If HA^cjloo > 1 1 A^4 1 1 co or min^l/jjl =0, set <— (1 — 5)X_a and repeat 
steps 1-2. 

4. Let A z contain the indices of the zero coefficients in A. If HA^H^ > 
||A^||oo) augment A with j* = argmax^c Xj. Set A <— A \ A z . 

5. Set A_4 <— (1 — e)A_4 for some small e. 

Note that setting 5 = recovers the GLasso algorithm in Park and Hastie, 
while setting 5 = 1 results in the path for GForward. 

Acknowledgments. We would like to thank the Editor, Associate Editor 
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SUPPLEMENTARY MATERIAL 

Forward-LASSO with adaptive shrinkage (DOI: 10.1214/10-AOAS375SUPP; 
.pdf). This material contains the proofs of Claim 1 and Theorem 1. 
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